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This Memorandum Is a first draft of an essay on the simplest "learning" 
processes. Comments arc invited* Subsequent sections will treat, among 
other things: 

The "stimulus-sampling" model of Estes 
Relations between Pcrceptron-type error 
reinforcement and Bayeslan-type correlation 
reinforcement 

and some other statistical methods viewed in the sane way. 



1,0 Decisions based on fixed set of testa 

The events to be classified lie in a set {F^ of classes- 




.»QE2 



xl£- 




We are given also a set {<p.(X)l of "tests" or "experiments" that can be 
performed on the event that has just occurred* Define 



to be the sequence of results of these testa, taken in some fixed order. 

Usually we will restrict each q to be Boolean--that is. to have values or 1. 

Thus the machine H cannot "see* 1 the real events "directly," but only 
through the results of the experiments. There Is really no alternative 
to this sort of indirectness, because only mystics believe in the 
possibility of the immediate «nd direct experience of reality — and 
whether or not this Is in fact desirable, they are fpr some reaaon 
unable to transmit to others their grounds for believing this. 

Given the outcome 0{X) of a set of experiments, we would like to guess 

which class F contains the event X that has Just occurred. In nome situations 

we can be sure which F was responsible; for example, in the case that there 

Ls only one X that could have produced this particular value of *, {It is 
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convenient to think of 4 = (<p* .tp*. . . . ,<p ) as n vector, so that wr can talk 
about its value instead of the sequence of values of dli the tp' s. ) 

Shit it (ft*****! k we cannot usually be sure which P was responsible. We 
shall consider the rather general case in which our knowledge about the 
situation can be suntnarized in the form of a table, or distribution, of 
probabilities: 



P(F./*) - the probability that X is in ?. t 

given that the experiments have produced 



Hast of this chapter is devoted to discussing ways in which this sort of 
information could be acquired through experience. First, however, we will say 
a few words about how we could use this information if we had it! 
1. 1 Costs and decision criteria 

If a particular ^ * 4 n occurs, and if we know that 



i 



P(F k /* Q ) - (k f j) 



there is no question that X«F.> and we can say so definitely. But if 
< P < I wc have to guess, Usually, one will choose that J_for which 
p(F /* n ) is largest. Thts guess will have the smallest chance of being 

wrong— # n t io-c«Jf< A " max Imurn 1 lkc 1 Ihood c r 1 1 er ion . " 
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But in real situations, Just trying Co be right is not the only goal that 

may have Co bo considered. Different kinds of mistakes can have grossly 

different consequences. It can be more important to avoid the tiger than to 
acquire the lady. Let 



C. - the cost of guessing F. when X is really in F.- 



Then the "expected" cost* if we guess F. (given * )is 



c<Mi - 5 y<yy 



« 



That is, this the average price one will pay, over a large collection of 
experiments, if one has the policy of guessing F k whenever one sees chat 
particular $ . Clearly, our policy then should be to choose that k for which 
C(lc.j 1 is smallest . 

Only in t*ic most peculiar circumstances would it be that a correct gues* 

C ti W0U ^ <J coat Mrc t ' ian anv incorrect guess c. .(ifj). But one might have 
a situation in which 



*i-° 


(all j) 


<JJ-° 


(all j> 


<.j = l 


(il'l.jyi) 
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so that one might as well always guesa F (regardless of the value of J), since 
there is no premium on being right otherwise. In the case ti **U**k *-'' 

c Ii - ° 



we have 



E c jk P{Pj/«) = I - P(F k /l) 

(since £ P(F./i) - ljand minimizing this is, as it should be* the same as 
mafcLmUtng P(F,/4)» I.e., the maximum llkel'.hood strategy. 

K 
1.2 Inverting probabilities: Bayeg* theorem/ 

We have described decisions based upon knowledge o£ the probabilities 

P{F./*)- Usually, one stares with access to a different set, P{*/F )i of 

probabilities; P{4q/P ( } is the probability that * Q will occur if X«F ■ For 

example, understanding of the mechanisms by which X's in each 

F affect the devices that compute the q> 's theoretical 

calculation of what the P(*,F,) f s ought to be. Fortunately, one can calculate 

the P(F /*)*s from these as follows: By the definition of the conditional 

probability symbol 



FCB/A) - ^ 
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where P(B A A) Is the "joint" probability that both events B and A occur. But 
then, Also 



PW1) . toga 



because P(A A B) = P(B A A). 
Therefore 



P(*/F.)'P(F.) 
P(F /*) - J L 



j' ' P(*) 



Since P(S) «• £[P(F )P(*/F.)] is the probability chat * will occur, regardless 

k * * 

o£ which E. occuts, vt can wtitO 

P<*/F )-P<P ) 

showing that we can calculate the t{tj$) % 9 if wc know the P(£/F )'s and 
also the "a priori" probabilities P(F ) of occurrences of X 1 s of the different 
types. These are, of course, determined by the circumstances, and cannot be 
calculated from a theory of how the 9 devices work. 

The simple formula (A) is useful when one wants, so to speak, to invert 
the roles of "cause" and "effect" when one has either data or theory that goes 
in the wrong direction. He note that in many application one needs only to 
know the relative magnitudes of the different P(F./*)'s, and since the 
denominators of (A) does not depend on J, one r*t*<'^ h*j f* £#*tp« t*w vtltfc 



of j f*r *h«k 



1-6 



PH/F J )*P{F J ) (A') 



K ftc l«1*st 



1.3 Problems . of^iiaplenien ting the probability approach 

There arc o variety of practical and theoretical objections to the use 
o£ formula (A). It is difficult, even in the siraplest situation*, to know 
enough about the p(*/F ) f - to use this method. One night be able to 

deduce them from a theoretical model of the situation! but then one would most 
likely be able to construct a more sophisticated decision method involving 
less computation- If there are many possible $'s, then (t becomes impractical 
to estimate the P(t,F)'s on the basis of empirical observation* The amount of 
dflta wuld be tgo enono\i«. And, the system if completely unable to "gu»*" 
on $'s it ha* not seen before. Finally, even if one had a complete catalog of 
the P($,F)*s, knowledge in this form is so free of structure that lc would be 
very hard to adapt it to a similar, new situation* To be useful, knowledge 
has to be cast into structured models. 

The simplest and most common kinds of models are the "parametric 
distributions/' In this family of techniques*~which include many standard 
methods of statistics, one has a procedure in which one 

(a) assumes some form for the distributions of the(l/F)*s, 

(b) fits the data to estimate the parameters of these distributions, 

(c) designs a decision procedure based on the theory of the assumed 
distribution forms. 
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Example ; One thinks of the i = (<p ,...,9 ) as *s>o*fl+* In a vector 
space with the usual Cartesian distance metric. 

Assume that each set {*<X)|x«fJ forms a "cluster" with (say) 
a symmetric normal distribution. 



® 



■IK 



?£ 
m 



Each cluster hag (say) the same variance (concentration) but 
different means (centers). 

Then one can use the data co estimate the variance and means. 
In this model, the decision of which F a given 4 should be 
assigned to can be made by finding which F-mean is closest to $. 
This, and many other related strategies are discussed at length 
in Niisson's "Learning Machines,*' a book on the theory of a variety 
of statistical and threshold decision methods. 

The trouble with the parametric oodels, and their relatives, is that even 
those that are the most sophisticated contain so little 

structure that they are usually thoroughly unsuitable for representing 
detailed knowledge about anything. (For example, they cannot sac isfactorily 
represent finite-state processes.) 

Nevertheless, we shall proceed co study the simplest x«ek- nodei, 

in which the $'9 are statistically independent. Our conviction is that unless 



> 
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this simple "linear" case ia thoroughly understood, one can have little 
chance of making good "intuitive" Judgements about more complicated systems. 
1.4 Independence 

We can evade some of the problems mentioned In §1.3 if we can assume 
that the tests ^(X) ate statistically independent for each p , Mathematically 
this means that for any sequence S(X) * <fflj(X) >. « • %<*)) of values of * we 
can assert that 

? <*1«A— A \WV " P< ( I' l (X)/F J>"---' P < C P 1 „< X >' F j > 
for each j. More compactly we say 

Informally, given that P has occurred, P(a/p ) gives the distribution 
of each of th« tp's. If one is further told the values of some of the a's . 
Independence means that this ftives absolutely no further information about the 
values of the remaining jp/j! We want to emphasize that this is a most 
stringent conditio 

Experimentally one mi^t get independence when there are variations in 
responses of f'x tattitfC *f "noise" 

measurement uncertainties within the ip-mechanlsms, #ne would not fhs^t e«*e*t 
the values of $e*t <p's ^e help predict those of d-Htw; 



or 



X 




But where the variations in a o t given F., due to selection of different X's 
frog the same F-clasS j one would not ordinarily assume independence, since the 
value o£ one of the <f's tells something about which X in P has occurred, and 
hence could help at least partly to predict how another 9 
will behave. 




ftltflt- Iftjt/^i^q- 



An extreme example of non-independence la the following* in which there 
are two functions, rp and <fL» and two classes, F. and F . 
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Then 



3*t 



<Pj(*) - a pure random variable with PC? (X) - 1) = A, 
Its value is determined by tossing a coin, 
not by X. 

<? 2 (X) - f cf,<X> if XCF, 



O 



f?l(X) if x«f 2 - 






Notice Chat neither c^ nor q> 2 taken alone give any Information whatever about 
*I Each appear* to be a random coin toss. But from both one can determine 
perfectly which F has occurred, for 



*1 ' '2 "> F 2 



with absolute ccrtainity. 



L-Ll 



Remark: We will assume only independence within each class F . 
It X ranges over several F's then knowing one tf-value can help 
guess another. For example, suppose that 



°1 = ¥ 2 * ' lf X€P 2' 

The two <p' s can (in fact are) independent on each F. But if we 
did not know that XcF . and we then told that 9, ° 0, we could 
Indeed then predict that <p 2 ° also, without this violating our 
independence assumption. If we had been told that X«F 1 , we could k*/# 



aiready predictdthe value of <p.» and in that case learning the 
value of q>. would not have helped! 

2,0 The 1 in ca r ma x 1 mum 1 ike I Lhpod e a l ir^ t a r 

We will assume that the cp's arc Independent (for each F.). Ko have j*tff 

observed a case in which * * (<p t , » t ,u ) and we want to know which F haa 

1 m 1 

the greatest probability: Which is largest: of 



P(F l /«),... l P(F n /*)? 



Now, using formula 00 of §1.2, this is equivalent to choosing the J for which 
P<*/F )'P<F ) is largest. For brevity, define 



«J 



Pj -KfJ. 



We will discuss only Boolean ?'&; that is, each <p has value or l* 
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Then, using (B) of §1.4, we can define 

£g Che quantity to be maximized. It Is formally convenient Co multiply and 
divide by (i-p, .) to obtain 

V J 

■Y TT (ttH- IT <>-, 

3 Vl -l vl p ij'all 1 
for then *<-k« o,l D * t ^ r k «* ] *<* *W*I*»«S 

uta B' '< *■ CtfW^ 1 ^ "^*^ J*^ *** if^^ci 0^- 5 *• +^4^ 

the influence of the actual experiment is concentrated in Che product 

expression. The terms 



v 



are the "odds" or "likelihood ratios 11 of getting ^ = 1 if XeF , 

Ct is formally convenient now to replace (C) b*j i4** logarithm 

because Bums arc more familiar than products. This changes only 
the form, not the content; since log(x) increases when x does, we still select 
the xihixlmurci of 
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where h = lofii* Th* term 

is aptly called the "weight of the evidence" of tp ( In E.ivor of P.. Now we 
can write 



J s-l J 



J All t " 1 



where all that does not depend on che outcome of the actual experiment 4 is 

absorbed Into the constant k. which is therefore a sort of "a priori" weight 

for F- (ae opposed to the Iw jp. term which is "a posteriori"). 

The important conclusion la that the decision can be made upon the basis 

of a linear combination of the terms. This is due directly to the assumption 

we made about the Independence of the P t /« tor each j. 

There are slight as&yrametries in our formula chat come from 
treating <p ( *i as An occurrence of an event and ^.=0 as a 
non-occurrence of the same event. This is quite arbitrary, and makes 

by 1- J. Good 
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tt hard to return to the more general situation tn which the 
epxeriments {(p } could each have many values. Besides, our 
algebra Introduces a quite unnecessary risk of dividing by 
zero! But on the whole, we gain an heuristically clearer 
picture and, with thU insight, tt would be easy enough to 
return to the original formulae for repairs. 

Formally, one can go slightly further in simplifying the 
formulae. Create a new "experiment" <p a wh value is always 

1. Define v * fa . Then If we re-define the vector $ to be 

(^q*^ 9 ra ) a ™J define vectors W ° (w ,w . .. - ( w ) our 

procedure is: choose that j for which 



w>-§ 



r is maximal. Later we will give a more-or-less meaningful 

geometric interpretation to this vector product formalism, 
I- but right here it is not very illuminating. 

2-1 La^tr~Ma<km<s 

Formula (D) of §2.0 suggests the design of a machine for making our 

decision? 
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where D Is a device that «implj] decides which of its inputs Is the largest. 
Each cp-devtce emits a standard-sized pulse {if tp(X) - 1) when X is presented. 
The pulses are multiplied by the w quantities as indicated, and summed at the 
i> boxes. 



Returning for the nonient to costs , if we combine the observations of 
Sl.l and §2.0 we will want to minimize (for k) 



whereV =p /(1-p ,)• It is interesting that this more complicated procedure 



also lends itself to the faj«v - *W*ct*rt 




1-16 



2. I The two-class case 

If thercsr*only two classes P. and 1? the decision formula becomes 



very simple; we choose F. if 



I.e. p when 



1 1 



J 1 < W U " \2> 'i > c 2 " C l (A) 

which has the form 

J * t * t > 6 - (B) 

A decision function of this sort Is called a "linear threshold function" 

because the choice depends upon whether the linear function £i cp exceeds the 

"threshold" ©. It la certainly of the simplest ways to make a decision that 

is not entirely trivial, 

* 
Notice that we derived the formula on the basis of some very strong 
assumptions — notably the independence of the P-.'s, These will not 
be true in general, and indeed it will bo exceptional that the 
conditions will be close enough to make the formula (A) useful. 
The formula (B) is somewhat more general--and therefore somewhat 
more widely useful in that, while it retains the linear threshold 
form, it does not require that a,, be precisely 
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and suggests that even if the statistical assumptions of independence 
do not hold, there night be other value* of cr^ that could be used. 
This is sometimes true. 

3.0 E^mating the P L j' s 

3.1 Laplace estimation 

The roost obvious way to estimate p ia to present some events in F^ and 
record the value of 9 & . If N events occur and we observe H occurrences of 
_ , we can estimate chat 






Naturally, this estimate will give different values for different samples; 
It ha« a atattstical distribution. In fact, it will have a binomial distribution 
about the true mean, with variance 



oZ.d^Bl, 



He will not derive this well-known result, but most readers will 
recognize that it is plausible because 

(1) if n-1 the distribution has two points whose weights 

and distances from P V« *U v«*<**cc 

o 2 - U-P)P Z + P<l-P) 

• P<l-P) 
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(ii) one is used to the uncertainty decreasing with the 
square root of sample size, 

(lit) the binomial distribution is so simple and fundamental 
that one would not expect any other factor to enter. 

(end of mathematical joke) 

To use this we have to keep a record of N. To "update" the estimate 

NT 
if tor each observation d-**W +^* «*^ e#4l«**W *f p ^ P 



If 4,1 fU*v 



/•*». 



H+l 



l-JJl + ter 



while [f q= - 0, 



.h • {• - ig» 



Both can be summarized by 

"Pfcif i^a**^ **'H another way to estimate p I What would happen If 
wc replaced the multipliers (1 - -) and g by constants that don't depend on 
N — the number of trials? It is desirable to make these constants still add 
to unity i*&*s -jj ^ifa'p)* * so thflt the e "lm flted "probabilities" will 
also add to unity. 
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3-2 Reinfor cement estlniatlgri 

JB 

We are estimating the probability that qj{X) = 1. After each event <p 
we revise our t-th estimate jt J by the formula 



where 9 iM constant between and I. For simplicity we begin by setting 
JP m tf * By applying the formula repeatedly w* obtain 






* * 



f*. e B ^+ (i -•>[«** + *M+ « 2 ^+ ... + a 1 - 1 ^ 



which we can wrtte at 



■ ~~ ;TT* — ffB 



c/ 



n 
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where the denominator is equal to 1/(1 - 9). The last formula is written to 
show more clearly chat p- J can be expressed as a simple weighted sum, i.e-» 
averafte of the cf '&. We have two reasons for wrttlnfc the gum in this form: 

<1) Exponential decay of "memory" 

The formula shows that the value of q at time n is 

(a) a weighted average of the rr Jr s 

(b) the weights, i.e., the influence of older events falls 
off exponentially. For we can 4t{*** 



o 



and we can write 



where the derivative can be interpreted as shoving how much cr^depends on the 



% that occurred t momenta before* The first term, iff\ always retains slightly 
more weight (by a factor of "j~~a) but Lt> too. La subject to the some decay. 

The procedure _jg an [estimator for p 

This follows from the general theorem that if x,,...,x are 

1 n 

independent random variables and E\X/ is the mean or "expected" value of a 

randon variable., then 
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for any set l& ± ) of constants. Since each tp. had E<JA = p wc gct 



!v 




^r =p - 


j j 





The exponential form has advantages In some situations: 

(1) We do not have to store the number N of trials, and 

(2) The estimator lends itself to "adaptation" for changing situations. 

On tho other hand. It does not optimally use the experience In non-changing 
situations. 

We can appraise the efficiency of (RJ in using its data by computing its 
variance — the mean square error about p. 

3.3 The variance of the reinforcement estimator (ft). 
S\ 

the transformation 



Suppose that X has the distribution f(X) with mean u and is subject to 



x ( = ex+ (i - e)q> 

where prob(;p = 1) = p. Then the distribution of X" la 

f'(X) - ncr(X) + (I - P )pX 
where 

a{X> - 4 f(l - i|5) and fl(X) = 4 f4>. (l) 
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Then the variance o£ £' (X) (s (as shown below) 



°l> - p o? + <l-P> °? 



* - a ■ - - -p + p«-p»W 



<" ) 



If the variance of f is G*l then £rocn (I.) wc have 



o 2 - e z . 



* - 1-6 + tfc c 



a? 



BO. 



- %*. 



hence 



t£. - 9 2 0? + p(l-p)(l-e) 2 . 



This is linear in a. with vLopc & < 1, hence iteration oust lead to a limit 
("fixed") point for the variance. At this Limit we must have 



c 2 - eV + 



p(l-p)(l-0) - 



L ■ p(l - p) & 



\. 
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Proof of (II ): 

o 2 ,- J" f'QQCE-Mfi)* - f t'WX 2 " l*f. 

= P Ja(X)X 2 + <l-p)J'p(X>X 2 - [i Va +<l-p) |lp J 2 

= P [Ja(X)X 2 - p 2 l * p,. 2 - p 2 (1 2 

+ d-p)[Jp(x)x 2 ^ 2 ] + (l-p)n 2 - (l-p) 2 ^ 



■ 2p(»-P)i';/p 



P<£+ (1-p)^ + P(l-P)((i a -Ji p ) 2 . 



3.4 T he role of 8 as a memory t ino-constant 

(ft) 
When the reinforcement process has been tn operation for a very long 

A 

time, the variance of Its expected value does not approach xero (as It does 
for the Lplace estimate when H Increases beyond bound). This is because it 
depends more upon the remote past; indeed the very caoflt recent terra can alter 
the current value by more than 1-8, a constant* (in the uniform weight 

process the effect is loss than -- which becomes arbitrarily saall,) If we 
"equate" the variances 



Bilalspd-p, • 



i+e 

we get 

1+6 
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For small values of 6 this it a little more than unity, showing (correctly) 
that the estimate is based olroojft entirely upon the last event. In fact, in 
this case 



For large 6, i.e., close to unity, we have 



n ~TTe 



so that if 6 ■ 1 then the variance is about that one would obtain by 

in ■* 

fiimpU avera&ing of the last 2m samples! Thufi one cart Chink of the quantity 
t^S ai corresponding roughly to a time constant for "forgetting." 
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3.5 The Samuel compromise 

In his classical paper about "Some Studies in Machine Learning using 
the Gome of Checkers/ 1 Arthur L« Samuel uses an ingenious combination of 
probability estimation methods. In his application it occasionally happens 
that a new evidence term # . is introduced (and an old one is abandoned because 
it has not been of much value in the decision process). When this happens there 
is a problem of preventing violent fluctuations, because after one or a few 
trials the tern's probability estimate will have a large variance as compared 
with older terms chat have better statistical records, Sanuel uses the 
following algorithm to "stabilize" his system: he sets p** - j and 



j*- «.*;>♦ j* 



where 



X = 16 if t < 32 

N - 2° If 2° * t < 2** 1 

N - 256 i£ 256 £ t. 



Thus, in the beginning the estimate is made as though the probability had 
already been estimated to be ~ on the basis of several, l»e»> the order of 16, 
trials. Then, in the "middle" period, the algorithm approximates the uniform 
wieghting procedure. Finally (when t ** 256) the procedure changes to the 
exponential decay mode, with £i*ed h\ so that recent experience can outweigh 
earlier results. The use of powers of two represents a convenient computer- 
program technique for dong this. 



1-26 



In Samuel's system, the terms actually used have the form 



$-2$. 



bo that the "estimator" ranges In the interval -1 £ C £+1 and can 



u 



be treated as a "correlation coefficient." I mention this here only 
to justify Samuel 1 * initial setting of |f tt> ~ t i.e., to C to 0. In 
his context, this sotting makes perfect sense, uhereau in our 



interpretation the setting of pr* to - would be arbitrary. 
3 . 6 Variants of the rviruorflci'm^nt. estimator 
Consider the "reinforcement process" 



a* - ft* + (1-6)9 



cv 



If the distribution of o has mean u then the distribution of a" will be 



jt 1 = (i-p)8u + p (9m + a-e>) 



because there is probability (1-p) that ?f0 and probability p that <p*i* Then 



M - o p^i + (l-8)p- 



Applying this again we get for the mean of (a ) » 
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,." = 9(8;. + (l-9)p) + (l-S)p 

- 9 a n + (i-9)p(l+B) 

- 9 2 m + <l-0 2 )p 



and slmtlarlly , If we apply R n times, we get 



Clearly as n - = f u* n * - p. 

This analyst* can be replaced by a more general method, by using the 
following two simple observations; 
Ltmma 1 : If a 1 ■ f(a t <p) is linear in tp then if n(a) = m and p(q^l) then 

n(a , > - f(m,p). 

Proof t li(a') - <l-p)f(m,0) + p'f<m,l) 

= f(m,0) + p[f(m,l) - f<m,0)] 

Lemma 2 : If | af g^ ( < i then the li»it of a,f(a), f(f(*J), ... f* n) (a) exist* 
and is the (unique) solution of the equation 

. y - f<y.p>- 
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Proof: This Is a "fixed point theorem" and the diagrams below show why It 
La true: 





Sow we can apply these to the formula R,» and have only to solve 



y = 8y + (1-0)P 



or 



1-8 _ 
y " TT5 P * P' 

But with the lenmaa, we can also analyse some oth« systans. Another 
Intereiting one Is 



d» - 0(a + cp) 



<«2> 



and wc hAV€ 



y - e<y + p) 



or 



i-e 
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so chat this, too* is an estimator for p 
by the constant factor t^q - 

Another* sonevhat different reinforccr iff 



a' 



{ 



Crt-l tf tpl 

ftl if ip-0 



which has to be corrected 



(83) 



which can be written as 



a' - 9 + a(<p + 9 - <p0). 



This satisfies the conditions of the Lemmas, since 



||£.| - |(pfe- v e| - |i-(i-<p)(i-e)| < 1 



so ve have 



y ■ p + y(p+ 8 - p6> 



or 



hF?) 



This is an estimator (with the %* correction factor) of the Likelihood 

i*y 

ratio rj~J" Ic ifi interesting that this is so easily obtained by a reinforcement 
process as simple as: "if <p " 1 occurs, add I to a; if <p = occurs, m_ult_ip_ly 
a by WM 
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Another simple form is 



a 1 - lta + 



which leads to the estimate 



y-irrP 



(V 



Finally, one might consider the very simple form 



of ™ a 



<R 5 ) 



This "diverges," i.e. , the a's grow beyond bound (and do not satisfy the 
condition for Lenna 2). Still P if one is making decision*by comparing 
different P^'s* one can use the ratios of these simple "scores" as likelihood 
ratios* to obtain the "uniform weighting" type of behavior. We include R- 
only to indicate its heuristic similarity to the other*. 
3.7 A simple "synaptic" reinforcer theory 

Let us make a simple "neuronal model." The model is to "account" for the 
following phenomena: 

Li There is to be a quantity a that estimates p - P(qj./FJ; 



2. The only information available are the occurrence of ip * 1 and 



w . 
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***- 




The bag B. contains a very high and constant concentration of a substnnce A. 
When (£L or P occur--or "fire" — the valla of the corresponding bags B and/or 
C. become "permeable" to A for a moment. If <p alone occurs, nothing really 
changes, because B. is surrounded by the impermeable C.. If F. alone EiTMj 
it loses some A by diffusion to the outside; in fact, if a is the amount of A 
in 0. it may be assumed/ by the usual lavs of diffusion and concentrat lon)to 
lose some fraction (1-8) of a: 



a 1 - to if 



Fj **ct>*> a+A 



: o 



If both cp and F are active then approximately the same loss will occur from 
C.. But an essentially constant amount b will be "injected" from B, to C.. So 



a' - fij + b 



Fj ouvo **<J 



f**i 



We can assume that b is constant because the concentration of A io very hi^h in 
B. compared to that in C.i Or one can invent any number of similar variations. 
In any case we get 



Q 



tb + <(* 
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eo that in the limit the mean of a will approach 



l-e 



which is proportional to, and hence an estimator of, p ■ Piobft? /f )- 

Thus the simple geometry together with the idea of a membrane becoming 
permeable briefly following a nerve impulse gives us a quantity that 
is an estimator of the appropriate probability. 
How could this representation of probability be translated iuto a useful 
neuronal mechanism? One could Imagine all sorts of schemes: Ionic 
concentrations--or rather, their logarithms: —could become membrane potentials, 
or conductivities, or even probabilities of occurrences of other chemical 
events* The "anatomy** and "physiology* 1 of our model could easily be modified 
to obtain P processes and their attendant "likelihood ratios." Indeed, it is 
so easy to imagine variants«the idea is so insensitive to details—that I don't 
propose it to be considered seriously, except as a family of simple yet 
intriguing models that a neural theorist should have available. 



